Connect to database with project data
sqlite_file = 'lahmansbaseballdb.sqlite'
con = sqlite3.connect(sqlite_file)Course DS 250
Anastasia Yazvinskaya
Achievements in sports are always an analysis of several areas. Unfortunately, the overall percentage of victories for all time does not show ideal indicators for a player, since one could play only one game and the team could win, while another player will play 10 games and lose at least one of them, already put him in second place. Also important are other factors when comparing players and teams, which I considered in this project, during the study of data analysis through SQL queries.
Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.
Based on the data we have, we see that three players were BYU-Idaho students, but only two of them have salary information available. Analyzing each of them we see that the player with ID stephga01 had a salary increase from 1997 to 2001, after which his salary fell. The player with ID lindsma01 from 2007 to 2009 playing for the FLO team had a small salary increase. But after moving to another team in 2010 his salary increased 4 times. And although in the end by 2014 his salary increased more than 2 times, after 2010 he changed 2 more teams until in 2013 he joined the CHA team with a salary decrease, which was compensated a year later in 2014.
q = '''
SELECT DISTINCT cp.playerID, sal.salary, sal.yearID, sal.teamID
FROM collegeplaying cp
JOIN schools sc ON cp.schoolID = sc.schoolID
LEFT JOIN salaries sal ON cp.playerID = sal.playerID
WHERE sc.name_full='Brigham Young University-Idaho'
ORDER BY sal.salary DESC
'''
table = pd.read_sql_query(q,con)
table| playerID | salary | yearID | teamID | |
|---|---|---|---|---|
| 0 | lindsma01 | 4000000.0 | 2014.0 | CHA |
| 1 | lindsma01 | 3600000.0 | 2012.0 | BAL |
| 2 | lindsma01 | 2800000.0 | 2011.0 | COL |
| 3 | lindsma01 | 2300000.0 | 2013.0 | CHA |
| 4 | lindsma01 | 1625000.0 | 2010.0 | HOU |
| 5 | stephga01 | 1025000.0 | 2001.0 | SLN |
| 6 | stephga01 | 900000.0 | 2002.0 | SLN |
| 7 | stephga01 | 800000.0 | 2003.0 | SLN |
| 8 | stephga01 | 550000.0 | 2000.0 | SLN |
| 9 | lindsma01 | 410000.0 | 2009.0 | FLO |
| 10 | lindsma01 | 395000.0 | 2008.0 | FLO |
| 11 | lindsma01 | 380000.0 | 2007.0 | FLO |
| 12 | stephga01 | 215000.0 | 1999.0 | SLN |
| 13 | stephga01 | 185000.0 | 1998.0 | PHI |
| 14 | stephga01 | 150000.0 | 1997.0 | PHI |
| 15 | catetr01 | NaN | NaN | None |
This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)
a. Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.
In this list we see the top 5 players who have 100% batting in certain years. However, these figures may not be entirely correct, as they take into account players who had at least one batting and if it was successful, they became the leaders in the list.
| playerID | yearID | batting_avr | |
|---|---|---|---|
| 0 | aberal01 | 1957 | 1.0 |
| 1 | abernte02 | 1960 | 1.0 |
| 2 | abramge01 | 1923 | 1.0 |
| 3 | acklefr01 | 1964 | 1.0 |
| 4 | alanirj01 | 2019 | 1.0 |
b. Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.
In this list we see the top 5 players who have the highest batting average in certain years. By counting players who have at least 10 batting averages this year we have improved the accuracy of the table by looking at the results on multiple attempts.
| playerID | yearID | batting_avr | |
|---|---|---|---|
| 0 | carsoma01 | 2013 | 0.64 |
| 1 | nymanny01 | 1974 | 0.64 |
| 2 | altizda01 | 1910 | 0.60 |
| 3 | johnsde01 | 1975 | 0.60 |
| 4 | silvech01 | 1948 | 0.57 |
c. Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.
In this list we see the top 5 players who have the highest batting average throughout their playing career. Considering players who have at least 100 batting averages throughout their career, we have improved the accuracy of the table by looking at the results over multiple attempts throughout their playing career.
| playerID | batting_avr | |
|---|---|---|
| 0 | cobbty01 | 0.37 |
| 1 | barnero01 | 0.36 |
| 2 | hornsro01 | 0.36 |
| 3 | jacksjo01 | 0.36 |
| 4 | meyerle01 | 0.36 |
Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph using Lets-Plot to visualize the comparison. What do you learn?
I compared the top two teams by win percentage relative to their salaries (with available salary data). Among the top 2 teams by win percentage, OAC is in first place. However, regarding salaries, we see that their average salary is more than two times less than that of the number 2 team. This gap began in 1996. Before that year, the average salary in both teams was approximately the same, after which UAC reduced the salary, and BRS, on the contrary, increased it. However, this did not affect the results much. Although it is also worth separately considering the graph of win percentage for these years. Perhaps such a salary increase was strategically important for attracting more experienced players, which in general, in the history of the team, helped them become more stable in recent years.
q = '''
WITH bestteams AS (
SELECT t.teamID, tf.franchName,
SUM(t.W) * 1.0 / SUM(t.G) AS win_rate
FROM salaries slr
JOIN teams t ON slr.teamID = t.teamID
JOIN teamsfranchises tf ON t.teamID = tf.franchID
GROUP BY t.teamID
ORDER BY win_rate DESC
LIMIT 2
)
SELECT bt.franchName, slr.yearID as year, ROUND(AVG(slr.salary), 2) AS salary
FROM bestteams bt
JOIN salaries slr ON bt.teamID = slr.teamID
GROUP BY bt.teamID, slr.yearID
'''
table = pd.read_sql_query(q,con)
# Draw chart
chart = (ggplot(table, aes('year', 'salary', color='franchName')) + \
geom_line(size=1.5) + \
geom_point(size=3)) + \
scale_x_continuous(format="d") + \
ggtitle(f'Comparison of Two Best Teams by Salary') + \
labs(
x='Year',
y='Salary',
color='Team names')
chartComparison of Two Best Teams (by win rate) by Salary
Write an SQL query that provides a summary table showing the average salary for players in each position (e.g., pitcher, catcher, outfielder) across all yearslr. Include the following columns: * position * average_salary * total_players * highest_salary
The highest_salary column should display the highest salary ever earned by a player in that position. If no player in that position has a recorded salary, display “N/A” for the highest salary.
Additionally, create a new column called salary_category using a case statement: * If the average salary is above $1 million, categorize it as “High Salary.” * If the average salary is between $500,000 and $1 million, categorize it as “Medium Salary.” * Otherwise, categorize it as “Low Salary.”
Order the table by average salary in descending order.
Print the top 10 rows of this summary table.
All positions have a high salary on average. Although it is interesting to note that at their maximum there are 4 positions that had identical salaries.
q = '''
SELECT
fld.pos AS position,
ROUND(AVG(slr.salary), 2) AS avg_salary,
COUNT(DISTINCT slr.playerID) AS total_players,
CASE
WHEN MAX(slr.salary) IS NULL THEN 'N/A'
ELSE CAST(MAX(slr.salary) AS TEXT)
END AS highest_salary,
CASE
WHEN AVG(slr.salary) > 1000000 THEN 'High Salary'
WHEN AVG(slr.salary) BETWEEN 500000 AND 1000000 THEN 'Medium Salary'
ELSE 'Low Salary'
END AS salary_category
FROM fielding fld
JOIN salaries slr
ON fld.playerID = slr.playerID
GROUP BY fld.pos
ORDER BY avg_salary DESC
LIMIT 10
'''
table = pd.read_sql_query(q,con)
table| position | avg_salary | total_players | highest_salary | salary_category | |
|---|---|---|---|---|---|
| 0 | 1B | 2800540.86 | 1268 | 33000000.0 | High Salary |
| 1 | OF | 2640496.41 | 1796 | 28000000.0 | High Salary |
| 2 | P | 2357970.04 | 2872 | 33000000.0 | High Salary |
| 3 | 3B | 2193536.57 | 1115 | 33000000.0 | High Salary |
| 4 | SS | 2180598.07 | 747 | 33000000.0 | High Salary |
| 5 | 2B | 1811115.32 | 904 | 26187500.0 | High Salary |
| 6 | C | 1745414.69 | 482 | 23000000.0 | High Salary |
Calculate the average career length (in years) for players who have played at least one game. Then, identify the top 10 players with the longest careers (based on the number of years they played). Include their: * playerID * first_name * last_name * career_length
The career_length should be calculated as the difference between the maximum and minimum yearID for each player.
All players in the table have been playing for teams for over 25 years. The most experienced player has been playing for 36 years.
q = '''
SELECT
app.playerID,
ppl.nameFirst AS first_name,
ppl.nameLast AS last_name,
(MAX(app.yearID) - MIN(app.yearID) + 1) AS career_length
FROM appearances app
JOIN people ppl ON app.playerID = ppl.playerID
GROUP BY app.playerID
HAVING SUM(G_all) >= 1
ORDER BY career_length DESC
LIMIT 10
'''
table = pd.read_sql_query(q,con)
table| playerID | first_name | last_name | career_length | |
|---|---|---|---|---|
| 0 | altroni01 | Nick | Altrock | 36 |
| 1 | orourji01 | Jim | O'Rourke | 33 |
| 2 | minosmi01 | Minnie | Minoso | 32 |
| 3 | olearch01 | Charley | O'Leary | 31 |
| 4 | lathaar01 | Arlie | Latham | 30 |
| 5 | mcguide01 | Deacon | McGuire | 29 |
| 6 | eversjo01 | Johnny | Evers | 28 |
| 7 | jennihu01 | Hughie | Jennings | 28 |
| 8 | ryanno01 | Nolan | Ryan | 28 |
| 9 | streega01 | Gabby | Street | 28 |